An Improved Analytical Superscalar Microprocessor Memory Model

نویسندگان

Xi E. Chen

Tor M. Aamodt

چکیده

As the number of transistors that can be integrated onto a single chip continues to increase exponentially, a growing challenge is modeling performance with reasonable accuracy in the early stages of processor design. While methodologies for execution driven simulations are well understood, comparatively little is known about how to develop accurate analytical models. Processor architects in industry have occasionally employed ad hoc analytical modeling techniques in an attempt to rapidly focus the search for higher performance designs. Moreover, analytical models can provide insights that a detailed performance simulator may not. This paper proposes techniques to accurately model the performance impact of long latency data cache misses in a superscalar microprocessor. A pending data cache hit results from a memory reference to a cache block for which a request has already been initiated by another instruction but has not yet completed (i.e., the requested block is still on its way from memory). These pending cache hits have a non-negligible influence on accuracy of analytical models when analyzing memory intensive benchmarks. We propose a technique to quickly identify pending data cache hits and account for their effect on performance by analyzing memory reference patterns without performing detailed performance simulations. We also propose a novel profiling method to take account of the maximum number of outstanding cache misses supported by the memory system. Overall, these approaches improve performance prediction accuracy by a factor of 3.9 on average (error decreases from 39.7% to 10.3%) for a set of memory intensive benchmarks when the maximum number of outstanding misses supported is unlimited. Moreover, on average our model achieves 151 and 170 times speedup over detailed simulations with less than 10% error, when the maximum number of outstanding misses supported is sixteen and eight, respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An application specific multi-port RAM cell circuit for register renaming units in high speed microprocessors

We present a novel custom circuit for superscalar microprocessor renaming unit and compare its performance with a conventional design, referring to an industrial 0.35 μm CMOS process. Speed and power consumption are significantly improved.

متن کامل

Instruction-Level Microprocessor Modeling of Scientific Applications

Superscalar microprocessor efficiency is generally not as high as anticipated. In fact, sustained utilization below thirty percent of peak is not uncommon, even for fully optimized, cache-friendly codes. Where cycles are lost is the topic of much research. In this paper we attempt to model architectural effect on processor utilization with and without memory influence. By presenting analytical ...

متن کامل

A Split Data Cache for Superscalar Processors

Superscalar implementations of RISC architectures are emerging as the dominant high-performance microprocessor technology for the mid-1990’s. This paper proposes and evaluates a split data cache memory design, a new memory system enhancement for superscalar processor architectures. This design allows floating-point and integer memory accesses to be executed in parallel. The configuration is wel...

متن کامل

Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor

A new CMOS microprocessor, the Alpha 21164, reaches 1,200 mips/600 MFLOPS (peak performance). This new implementation of the Alpha architecture achieves SPECint92/SPECfp92 performance of 345/505 (estimated). At these performance levels, the Alpha 21164 has delivered the highest performance of any commercially available microprocessor in the world as of January 1995. It contains a quad-issue, su...

متن کامل

The Mips R10000 superscalar microprocessor

cache refills early. he Mips RlOOOO is a dynamic, superscalar microprocessor that implements T the 64-bit Mips 4 instruction set architecture. It fetches and decodes four instructions per cycle and dynamically issues them to five fully-pipelined, low-latency execution units. Instructions can be fetched and executed speculatively beyond branches. Instructions graduate in order upon completion. A...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

An Improved Analytical Superscalar Microprocessor Memory Model

نویسندگان

چکیده

منابع مشابه

An application specific multi-port RAM cell circuit for register renaming units in high speed microprocessors

Instruction-Level Microprocessor Modeling of Scientific Applications

A Split Data Cache for Superscalar Processors

Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor

The Mips R10000 superscalar microprocessor

عنوان ژورنال:

اشتراک گذاری